Background: There are numerous options available to achieve various tasks in bioinformatics, but until recently,\nthere were no tools that could systematically identify mentions of databases and tools within the literature. In this\npaper we explore the variability and ambiguity of database and software name mentions and compare dictionary\nand machine learning approaches to their identification.\nResults: Through the development and analysis of a corpus of 60 full-text documents manually annotated at the\nmention level, we report high variability and ambiguity in database and software mentions. On a test set of 25\nfull-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and\nambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an\nF-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively.\nWe characterise the issues with various mention types and propose potential ways of capturing additional database\nand software mentions in the literature.\nConclusions: Our analyses show that identification of mentions of databases and tools is a challenging task that\ncannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows\nimprovement and promise (primarily in precision), more contextual information needs to be taken into account to\nachieve a good degree of accuracy.
Loading....